Biopolymer automatic identifying method

ABSTRACT

The invention aims to provide a highly accurate automatic biopolymer determination technique utilizing mass spectrometry whereby calibration prior to measurement or the addition of an internal standard to a sample can be eliminated. The biopolymer automatic identifying method of the invention comprises: retrieving a candidate molecule by matching an observed mass value X obtained by mass spectrometry with a predetermined database; selecting an arbitrary number of candidate molecules with high similarity scores; calibrating the observed mass value X using the candidate molecule as an internal standard; calculating relative error Ec between a calibrated mass value Xc and a theoretical mass value M of the candidate molecule; determining the standard deviation S Ec  of the relative error; determining a tolerance Tc of database search from the standard deviation S Ec ; and repeating a database search based on the tolerance Tc.

TECHNICAL FIELD

The present invention relates to a biopolymer identifying technologyutilizing mass spectrometry, and more specifically, to a biopolymerautomatic identifying method capable of improving the accuracy of massdata obtained by mass spectrometry.

BACKGROUND ART

Mass spectrometry is an instrumental analysis technique whereby samplemolecules are ionized and then separated in accordance with themass/charge ratio (m/z) for detection. Using this technique, qualitativeanalysis can be performed based on the resultant mass spectrum, andquantitative analysis can be performed based on ion quantities.

The mass spectrometer (“MS”) used for such a measurement of molecularmass roughly consists of an ionization unit (ion source) for ionizing asample, an analyzer for separating ions in accordance with themass/charge ratio m/z (m: mass, and z: charge number), a detection unit(detector) for detecting separated ions, and a data analysis unit.

When subjecting sample molecules to mass spectrometry using theaforementioned mass spectrometer, the mass spectrometer must becalibrated prior to measurement. Specifically, since errors might beintroduced into the measurement by the mass spectrometer due to factorssuch as temperature changes, voltage accuracies, and electric circuitnoise, a calibration procedure must be carried out prior to the start ofmeasurement. In the calibration procedure, the chromatograph or the likeis removed from the mass spectrometer, and a predeterminedmass-calibration standard substance is introduced into the massspectrometer so as to obtain an observed mass value. The observed massvalue is compared with a known theoretical mass value, and the apparatusis adjusted such that no systematic error occurs in mass values (acalibration procedure according to the external standard method).

If an even higher accuracy of mass values is to be obtained, anadditional calibration procedure must be performed, whereby a knownsubstance is mixed in the sample and its mass is measured, and theactual measurement value is adjusted based on the mass value (acalibration procedure according to the internal standard method).

In general, identification of biopolymers, such as peptides or proteins,using a mass spectrometer (including the tandem mass spectrometer)involves a procedure referred to as a database search (or a librarysearch). In this procedure, the observed mass value of an unknown samplemolecule obtained by mass spectrometry is searched for by matching witha database (library) in which the primary structures or sequences ofapproximately 100,000 kinds of molecules are stored. In an expectedreference (standard) spectrum calculated based on the structureinformation, molecules with a spectrum similar to that of the unknownmolecule under investigation are allocated scores and selected.Candidate molecules are thus narrowed and listed, thereby eventuallyidentifying the unknown sample molecule.

However, the above-described mass spectrometer calibration procedure isvery troublesome work, requires much adjustment time, and is primarilyresponsible for the drop in work efficiency caused by the conventionalmass measurement operation. Namely, it has been impossible to carry outa measurement operation with high efficiency based on a continuousoperation of the mass spectrometer (without calibration). Further, in ameasurement system employing a plurality of mass spectrometers, it hasbeen extremely difficult to achieve uniform accuracy and reliability inthe individual apparatuses even if they are calibrated individuallyaccording to the external standard.

In the case of the external standard calibration, it has beenimpossible, using the conventional process of database search asdescribed above, to eliminate from the measurement data the influence oferroneous measurement in the mass spectrometer produced by influences ofthe external environment. Particularly, even those measurement errorsdue to subtle temperature changes (on the order of 0.2° C.) in themeasurement environment could not be ignored in some cases.

Furthermore, when a complex biopolymer mixture is measured by theconventional internal standard calibration method, the internal standardsubstance and the ion signals from the sample are superposed, whichprevents ion analysis. Thus, it has been difficult to select the type orconcentration of the substance that is put into the sample as theinternal standard. In order to achieve high mass accuracy for a widerange of masses, it has been necessary to introduce a number of internalstandard substances.

Also, human confirmation of each identification result has beennecessary, as the identification reliability has been low. Recentprogress in mass spectrometry, however, has made direct analysis ofincreasingly more complex biopolymer mixtures possible. This hasresulted in huge volumes of data that could not possibly be individuallyconfirmed by the human eyes. Therefore, there has been a need to developa highly reliable automatic identification technique for the analysis ofcomplex biopolymer mixtures.

DISCLOSURE OF THE INVENTION

It is therefore an object of the invention to provide a highly accurateand reliable method for automatically identifying biopolymers that isbased solely on data processing and that eliminates the need forcalibration of the mass spectrometer prior to measurement or theaddition of an internal standard to the sample in advance.

In order to achieve the aforementioned object, the invention provides abiopolymer automatic identifying method implementing the followingprocedures (1)-(7):

(1) A mass measurement procedure for measuring the mass of a biopolymerin a sample by mass spectrometry; (2) A database search procedure forsearching a predetermined database for candidate molecules by matchingan observed mass value obtained by said mass measurement procedure withthe predetermined database; (3) a candidate molecule selection procedurefor selecting an arbitrary number of candidate molecules having a highsimilarity score; (4) a mass value calibration procedure for calibratingthe observed mass value using the candidate molecules as an internalreference; (5) a procedure for calculating relative error between acalibrated mass value of a candidate molecule obtained in a previousprocedure and a theoretical mass value in order to determine thestandard deviation of such relative error; (6) a procedure fordetermining the tolerance (allowable error) of the database searchprocedure based on the standard deviation; and (7) a procedure forrepeating the database search procedure on the basis of the tolerance.The term “database” herein refers to a database of molecular structuresor sequences.

The mass value calibration procedure (4) may be a procedure in whichrelative error between an actual measurement value and a theoreticalmass value of a candidate molecule selected by the candidate moleculeselection procedure is calculated and a systematic error in the observedmass value is estimated by creating a least square line (a lineexpressed by the equation y=a×M+b, where M is the theoretical massvalue) based on the plots of the theoretical mass value and the relativeerror, and a procedure in which the observed mass value is calibrated bysubtracting the systematic error from the entire measurement values.

For example, in the case of a time-of-flight mass spectrometer, thesystematic error of a candidate molecule is determined from theaforementioned least square line. The systematic error is thensubtracted from the entire actual measurement values. Specifically, theequation (Xc−M)/M=(X−M)/M−(aM+b), where X is an observed mass value, Xcis a calibrated mass value, and M is a theoretical mass value, ismodified to Xc=X−M(aM+b).

Although the theoretical mass value M is given for the candidatemolecule, it is not given to all of the actual measurement values.Therefore, if the entire actual measurement values are to be calibrated,the term M(aM+b) in the above equation must be approximated by an actualmeasurement value. The values of a and b are generally much smaller thanthose of X and Xc, such that M(aM+b)≈Xc(aX+b). Substituting this intothe above equation yields Xc=X−Xc(aX+b), which can be modified to obtainXc=X/(1+(aX+b)) based on which all of the observed mass values can becalibrated.

In accordance with the biopolymer automatic identifying method of theinvention as described above, very accurate mass values can be obtainedfrom complex biopolymer mixtures solely by data processing. The highaccuracy of the resultant mass values makes it possible to identify anddetermine the biopolymers more unambiguously. Thus, the inventionprovides a highly reliable automatic identifying method capable ofanalyzing complex biopolymer mixtures.

The invention also provides information recording media, such as aCD-ROM, in which program information for causing a computer system tocarry out the individual procedures constituting the above-describedbiopolymer automatic identifying method is stored.

The aforementioned means makes it possible to eliminate the calibrationoperation of the mass spectrometer prior to measurement and the additionof an internal standard to the sample in advance. It also allows thebiopolymer automatic identifying method to be implemented with highaccuracy and reliability based solely on data processing.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows the relationship between the mass value (m/z) identified inExample 1 and error.

FIG. 2 shows the result of identification prior to mass calibration inExample 2.

FIG. 3 shows the result of identification after mass calibration inExample 2.

FIG. 4 shows the relationship between the mass value (m/z) identified inExample 2 and error.

BEST MODE FOR CARRYING OUT THE INVENTION

A preferred embodiment of the biopolymer automatic identifying method inaccordance with the invention will be described. It should be obvious,however, that the invention is not limited by the following embodiment.

The mass of an unknown biopolymer in a sample is initially measured by aconventional mass spectrometry method depending on purpose, therebyobtaining an observed mass value X. The mass spectrometry method mayemploy a tandem mass spectrometer, for example, which consists of aplurality of analyzers coupled in tandem. Specifically, in the tandemmass spectrometer, a particular ion (a parent ion) in a mixture isselected by the initial analyzer, and a collision dissociation isperformed between the thus selected ion and an inert gas in the nextanalyzer. Then, a dissociated ion (generated ion) indicating theinternal structure information is subjected to mass spectrometry by thefinal analyzer.

An observed mass value X obtained by the above mass measurementprocedure is converted into a format (a binary file: mass value andintensity) that can be read by conventional database search engines. Thethus converted value is then matched with a database in which a numberof molecules with known mass values are stored, so as to search for acandidate molecule that could possibly be the unknown biopolymer underinvestigation.

For the conversion of the observed mass value X, any of the generallyavailable types of software provided by the mass spectrometermanufacturers, such as MassLynx (from Micromass), may be appropriatelyutilized. The database search may be appropriately carried out by usingany commercially available database software, such as Mascot (fromMatrix Science).

From the results of the database search procedure, an arbitrary numberof candidate molecules (or a set thereof) with high similarity scoresare selected. The magnitude n of the set may be any number such that itrenders statistical processing possible.

Thereafter, the relative error E between the observed mass value X andits theoretical mass value M of each of the candidate molecules selectedby the above candidate molecule selection procedure is calculated inaccordance with the following equation (1):E=(X−M)/M  (1)

A mean value m_(E) of the thus obtained relative error E is thencalculated in accordance with the following equation (2):m _(E)=Σ(E)/n  (2)

The standard deviation S_(E) of the relative error E is then calculatedby the following equation (3):S _(E)={Σ(E−m _(E))²/(n−1)}^((1/2))  (3)Using this standard deviation, it is determined whether or not it isappropriate to use a particular candidate molecule for the internalstandard. When S_(E)<m_(E), the calibration is determined to be valid.

The magnitude of the systematic error is then estimated and subtractedfrom the observed mass value X, thereby obtaining a calibrated massvalue Xc. For example, in the case of a time-of-flight massspectrometer, the systematic error of the candidate molecule can bedetermined from the least square line y=ax+b with respect to the plotsof the theoretical mass value and the relative error, in the followingprocedure. When the relative error after the calibration of thecandidate molecule is Ec=(Xc−M)/M, Ec=E−(aM+b). Therefore:(Xc−M)/M=(X−M)/M−(aM+b)  (4)where X is an observed mass value, Xc is a calibrated mass value, and Mis a theoretical mass value.

Specifically, the above equation (4) is modified to obtain the followingequation (5):Xc=X−M(aM+b)  (5)

It is noted that although the theoretical mass value is given for thecandidate molecule, it is not given for all of the actual measurementvalues. Therefore, in order to calibrate all of the actual measurementvalues, the term “M(aM+b)” in the equation (5) must be approximated byan actual measurement value. The values of a and b are generally muchsmaller than those of X and Xc, such that M(aM+b)≈Xc(aX+b). Substitutingthis into Equation (6) yields the following equation (6):Xc=X−Xc(aX+b)  (6)

This equation (6) is modified to obtain the following equation (7):Xc=X/(1+(aX+b))  (7)based on which all of the observed mass values are calibrated.

The values of b and a in the aforementioned least square line can bedetermined from the following equations (8) and (9), respectively:b=Σ{(M−m _(M))×(E−m _(E))}/Σ{(M−m _(M))^2}  (8)a=m _(E) −b×m _(M)  (9)

The value of m_(M), which is the mean value of the theoretical massvalue M of the candidate molecule, can be determined from the followingequation (10):m _(M)=Σ(M)/n  (10)

The relative error Ec between the mass value Xc after mass calibrationand the theoretical mass value m can be determined from the followingequation (11):Ec=E−(aM+b)  (11)

Thereafter, the mean value m_(Ec) of the relative error Ec=(Xc−M)/Mobtained for the candidate molecule and the standard deviation S_(Ec)are determined from the following equations (12) and (13), respectively:m _(Ec)=Σ(Ec)/n  (12)S _(Ec)={Σ(E−m _(Ec))²/(n−1)}^((1/2))  (13)

Based on the thus obtained mean value m_(Ec), the calibration isevaluated. Ideally, m_(Ec)=0. Tolerance Tc for a database search is thencalculated based on the standard deviation S_(Ec), using the followingequation (14):Tc=K×S _(Ec)  (14)where K is 1.5 to 3.0, thereby completing the above-described series ofcalibration procedures.

In the above equation (14), K is an empirical constant for designatingthe confidence interval of the mass value. The K value can beappropriately determined depending on the accuracy of the software usedfor the database search. The higher the identification performance ofthe database search software, the closer K can be to 3, where a 99.7%confidence interval can be obtained. In the case of Mascot (MatrixScience) database software, K=1.5 can be empirically employed.

Based on the resultant tolerance Tc (Tc₁), the same database search isconducted once again. As needed, the above-described series ofcalibration and database search procedures are repeated a plurality oftimes so as to narrow the range of the tolerance Tc (T→Tc₁→Tc₂→ . . . )gradually, thereby enhancing the candidate molecule selection accuracy.Tc₁ indicates the tolerance obtained by the initial calibrationoperation, and Tc₂ indicates the tolerance obtained by the secondcalibration operation.

In this way, the accuracy of candidate molecule identification can beenhanced. Namely, the accuracy of identification of unknown samplemolecules can be improved.

The above-described procedures can be rendered into desired computerprogram information which can then be stored in various forms ofinformation recording media, such as CD-ROMs, Floppy™ discs, or otherforms of computer hardware, such as servers. In this way, the programcan be executed on a desired computer system or a computer network (viainformation and communications technology).

EXAMPLES

The time-of-flight mass spectrometer is an apparatus for measuring thetime it takes for an ion to travel a certain distance L in order tomeasure its mass according to the relationship between the mass m andthe time of flight T expressed by the following equation (15):T=L·(2 eV)^(−½)·(m/z)^(½)  (15)where e is the elementary charge and z is the charge number.

The mass measurement accuracy of this apparatus depends on L and theacceleration voltage V. L, which is an inherent value of the apparatus,may fluctuate due to temperature-caused expansions or contractions. Vmay fluctuate due to the drift in the supply voltage. Depending on themeasurement conditions, these fluctuations may cause a systemic masserror of 100 ppm or more. However, variations among mass errors (whichreflect the performance of the mass spectrometer) are relatively smallas compared with the mean value of the systematic error. By takingadvantage of this fact, the systematic error can be exclusivelyeliminated.

In the following, an example in which identification accuracy has beenimproved by the method of the invention will be described.

Example 1

One hundred fmol of tryptic digest of human serum albumin was measuredby HPLC-MS/MS, and a database search was conducted by MS/MS ions searchusing the commercially available Mascot database search software (searchparameters: peptide tolerance 250 ppm; and MS/MS tolerance 0.5 Da).

Based on the search results, the relative error E ((X−M)/M ppm) withrespect to the theoretical m/z identified for the 20 ions with thehighest scores was determined. The relative error E was then plottedwith respect to the theoretical m/z, as shown in FIG. 1. As shown, themean value of the original relative error E (indicated by ♦) wasapproximately 170 ppm, whereas the variations in E were within the150-175 ppm range, which are smaller than the value of E per se.

The mass was calibrated by finding a least square line with respect tothis group of ions and then subtracting it from the error in each ion.The relative error Ec after calibration (indicated by ▪ in FIG. 1) wassimilarly plotted, as shown in FIG. 1. The database search parametersdetermined from the variations in Ec (represented by the standarddeviation) were such that the peptide tolerance was 18 ppm and the MS/MStolerance was 0.080 Da. Thus, the mass calibration allowed thetolerances in a search to be reduced from 250 to 18 ppm and from 0.5 to0.080 Da; namely, by a factor of approximately 14 and 6, respectively,thereby enhancing the identification reliability.

Example 2

The following shows that erroneous identification can actually becorrected by the mass calibration method of the invention.

A peptide SRLDQELK, which is known to be liable to erroneousidentification during a database search based on mass data, wassynthesized in a conventional manner. One hundred fmol of the peptidewas then mixed with 100 fmol of the aforementioned tryptic digest ofhuman serum albumin, and a similar experiment was conducted. Under theconventional search conditions (with search parameters of peptidetolerance 250 ppm and MS/MS tolerance 0.5 Da), the synthetic peptide waserroneously identified, as shown in FIG. 2.

When the above-described mass calibration was performed, the peptide wascorrectly identified, as shown in FIG. 3.

Each ion in the MS/MS spectrum of the peptide was assigned to atheoretical product ion (b and y ion sequences) of each peptide(EKLTQELK and SRLDQELK) that had been identified, and its systematicerror was plotted with respect to the m/z, as shown in FIG. 4. In thecase of SRLDQELK (indicated by ♦ in FIG. 4), the relative error of allof the ions was within a narrow range, whereas in the case of EKLTQELK(indicated by ▪ in FIG. 4), the plots exhibited two differentdistributions. Thus, by improving the mass accuracy by data processing,it became possible to correctly distinguish and identify peptides withsimilar masses and with identical sequences in the c-terminal portion.

INDUSTRIAL APPLICABILITY

In accordance with the invention, the calibration operation of the massspectrometer prior to measurement, or the addition of an internalstandard to a sample, can be eliminated, thereby enabling continuousoperation of the mass spectrometer (without interruption by calibrationoperations). As a result, operators are freed from the burden ofequipment adjustment, such that the efficiency of the moleculeidentification operation can be improved.

Furthermore, the influence of error inherent in a mass spectrometer canbe eliminated, and a highly accurate and reliable biopolymer automaticidentifying method can be implemented based solely on data processing.In a measurement system employing a plurality of mass spectrometers,uniform data accuracy can be obtained in individual mass spectrometers,thereby reliably preventing the erroneous identification of an unknownsample molecule.

1. A biopolymer automatic identifying method, comprising: (a) obtaininga plurality of observed mass values, by subjecting a sample comprised ofone or more biopolymers to MS/MS, producing candidate molecules; (b)matching at least one of said observed mass values with a theoreticalmass value, in a predetermined database of known mass values using asuitably programmed computing device, for candidate molecules, whereinone of said candidate molecules has a high similarity score such that itis thereby identified as an internal reference; then (c) selecting atleast one candidate molecule from (b) that has such a high similarityscore using a suitably programmed computing device; (d) calibrating saidplurality of observed mass values with said internal reference toproduce calibrated mass values using a suitably programmed computingdevice, wherein said internal reference is the theoretical mass of theselected candidate molecule or molecules in (c), and wherein each ofsaid calibrated mass values is determined by the equationXc=X/(1+(aX+b)), wherein Xc is a calibrated mass value, X is an observedmass value,b=Σ{(M−mM)×(E−mE)}/Σ{(M−mM)2},a=mE−bXmM,E=(X−M)/M,mE=Σ(E)/n, andmM=Σ(M)/n, wherein M is the theoretical mass value for said candidatemolecule and n is the total number of candidate molecules; (e)calculating relative error between said calibrated mass value of acandidate molecule in (d) and a theoretical mass value to determine thestandard deviation of said relative error using a suitably programmedcomputing device; (f) determining a tolerance of the matching step usingsaid standard deviation (e) and a suitably programmed computing device,wherein said tolerance is determined by the equationT _(c) −K×S _(EC), wherein K is 1.5 to 3.0; optionally, (g) repeatingsteps (b)-(f) using a suitably programmed computing device; and then (h)comparing said calibrated mass values to said predetermined database,thereby to determine the identity of at least one of said biopolymersusing a suitably programmed computing device.
 2. The biopolymerautomatic identifying method according to claim 1, wherein said samplecomprises more than one biopolymer.
 3. The biopolymer automaticidentifying method according to claim 1, wherein each mass value ismatched with one candidate molecule.
 4. The biopolymer automaticidentifying method according to claim 1, further comprisingcommunicating said identity to a display or to a computer storagemedium.
 5. The biopolymer automatic identifying method according toclaim 1, wherein said calibrating step comprises: (A) calculating arelative error between said mass values and the theoretical mass in (d);(B) estimating a systemic error of said mass values by creating a leastsquare line by plotting the theoretical mass in (d) against saidrelative error; and (C) subtracting said systemic error from saidcalibrated mass values, X_(c).
 6. An information recording medium inwhich program information for causing a computer system to carry out theindividual procedures of biopolymer automatic identifying methodaccording to claim 1 is stored.